base loss
raised by multiple reviewers and next respond to individual questions
We thank all the reviewers for their feedback and pointers to relevant papers. This includes (Kendall et al., 2018), where they learn Kendall et al. 2018), we consider different loss functions on the same output space. There are specific reasons we did not use several multi-task learning algorithms mentioned by REV4 as baselines. Kendall et al. (2018) assumes that all base losses are applications of the same function (max likelihood in this case) We don't see how this method can be extended to our scenario where base losses do not necessarily Moreover, our regularization admits a very different nature. However, directly normalizing the base losses was sufficient for our experiments.
Flexible risk design using bi-directional dispersion
Many novel notions of "risk" (e.g., CVaR, tilted risk, DRO risk) have been proposed and studied, but these risks are all at least as sensitive as the mean to loss tails on the upside, and tend to ignore deviations on the downside. We study a complementary new risk class that penalizes loss deviations in a bi-directional manner, while having more flexibility in terms of tail sensitivity than is offered by mean-variance. This class lets us derive high-probability learning guarantees without explicit gradient clipping, and empirical tests using both simulated and real data illustrate a high degree of control over key properties of the test loss distribution incurred by gradient-based learners.
Student-Teacher Learning from Clean Inputs to Noisy Inputs
Hong, Guanzhe, Mao, Zhiyuan, Lin, Xiaojun, Chan, Stanley H.
Feature-based student-teacher learning, a training method that encourages the student's hidden features to mimic those of the teacher network, is empirically successful in transferring the knowledge from a pre-trained teacher network to the student network. Furthermore, recent empirical results demonstrate that, the teacher's features can boost the student network's generalization even when the student's input sample is corrupted by noise. However, there is a lack of theoretical insights into why and when this method of transferring knowledge can be successful between such heterogeneous tasks. We analyze this method theoretically using deep linear networks, and experimentally using nonlinear networks. We identify three vital factors to the success of the method: (1) whether the student is trained to zero training loss; (2) how knowledgeable the teacher is on the clean-input problem; (3) how the teacher decomposes its knowledge in its hidden features. Lack of proper control in any of the three factors leads to failure of the student-teacher learning method.